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Abstract 

In  this  paper,  we  consider  the  parallel  solution  of  recurrences, 
and  linear  systems  in  the  regular  algebra  of  Carre.   These  problems  are 
equivalent  to  solving  the  shortest  path  problem  in  graph  theory,  and  they 
also  arise  in  the  analysis  of  Fortran  programs.   Our  methods  for  solving 
linear  systems  in  the  regular  algebra  are  analogs  of  well-known  methods 
for  solving  systems  of  linear  algebraic  equations.   A  parallel  version  of 
Dijkstra's  method,  which  has  no  linear  algebraic  analog,  is  presented. 
Considerations  for  choosing  an  algorithm  when  the  problem  is  large  and 
sparse  are  also  discussed. 


1.        Introduction 

A  basic  problem  in  applications  of  graph  theory  is  that  of 
finding  shortest  paths  in  a  weighted  graph.   This  problem  arises  in 
finding  the  shortest  or  least  cost  path  in  transportation  problems,  in 
solving  minimum  cost  flow  problems,  in  finding  the  critical  (i.e., 
longest)  path  in  scheduling  problems,  and  in  finding  adversary  routes 
through  nuclear  fuel-cycle  facilities  [13,  23]. 

While  there  are  several  versions  of  the  shortest  path 
problem  [8],  we  will  deal  with  the  problem  of  finding  the  shortest  paths 
in  definite  graphs  G  from  one  node  to  all  others,  where  the  path  length 
used  is  the  sum  of  the  weights  of  the  arcs  in  the  path.   A  digraph  G(V,A) 
is  a  set  V  of  nodes  and  a  set  A  of  arcs  which  are  ordered  pairs  of  nodes. 
Assume  that  the  nodes  are  numbered  from  1  to  n  so  that  V  =  {1,  2,  ...,n}  . 

Each  arc  (i,j)  £  A  has  an  associated  length,  6.  .  .   By  defining  6    =  °° 

1  >  3  !  >  i 

1  <  i  <  n     and  6.  .  =  °°  for  (i,i)  t  k   ,  we  have  the  n  by  n  distance 
-   -  i»J 

matrix  D  =  [6.  .]  .   If  a  graph  G  contains  no  cycles  the  sum  of  whose 
1  >  J 

arc  weights  are  less  than  or  equal  to  zero,  then  we  say  that  G  is 
definite. 

Although  extensive  work  has  been  done  on  sequential  algorithms 
for  solving  the  shortest  path  problem,  little  work  has  been  done  [4,  6,  19] 
on  parallel  algorithms  for  this  problem.   Levitt  and  Kautz  [19]  discuss 
an  iterative  algorithm  due  to  Hu  [12]  which  has  been  tailored  to  a 
cellular  array  machine.   Chen  and  Feng  [4  ]  discuss  a  version  of  Ford's 
algorithm  [9,  10]  for  an  associative  processor  machine.   Dekel  and  Sahni 
[6  ]  present  a  method  for  cube  connected  and  perfect  shuffle  computers. 
Here,  however,  we  treat  the  shortest  path  problem  in  a  more  general 


way,  using  the  regular  algebra  of  Carre"  [2].   Furthermore,  we  will 
assume  that : 

(i)    any  number  of  processors  may  be  used  at  any  time, 

but  we  will  give  bounds  on  this  number, 
(ii)   each  processor  may  perform  either  a  comparison  or 
any  of  the  four  arithmetic  operations  in  one  time 
step,  and 
(iii)  there  are  no  memory  or  data  alignment  time  penal- 
ties. 
If  p  processors  are  being  used,  then  we  denote  the  computa- 
tion time  as  T  unit  steps.   Thus  T   is  the  time  required  by  a  serial 
machine.   We  define  the  speedup  of  a  computation  using  p  processors 
over  the  serial  computation  time  as  Sp=T  /T  <_  1. 

Carre"  [2],  has  shown  that  the  shortest  path  problem,  as  well 
as  many  other  problems,  can  be  posed  as  linear  systems  of  the  form 
x=Ax  +  b  in  a  regular  algebra.   He  gives  the  examples  of  the  scheduling 
algebra  of  Cruon  and  Herve  [5],  the  two  elements  boolean  algebra,  and  a 
stochastic  communication  problem  (Kalaba  [16];  Moisil  [20]).   Triangular 
linear  systems,  or  recurrences,  in  a  regular  algebra  also  arise  in  the 
analysis  of  Fortran  programs  [17].   Consequently,  it  is  important  to 
solve  efficiently  such  recurrences  on  a  parallel  computer. 

In  addition  to  considering  the  solution  of  linear  systems  in 
a  regular  algebra,  we  will  present  a  parallel  version  of  Dijkstra's  al- 


gorithm  [7]  which  is  applicable  only  to  graphs  with  positive  arc  weights. 
Furthermore,  considerations  for  choosing  an  algorithm  when  the  graph  is 
sparse,  i.e. ,has  a  large  number  of  infinite  entries  in  the  distance 
matrix,  are  discussed. 

1.1.  Notation 

We  follow  the  convention  that  capital  letters  denote  matrices, 
lower  case  letters  denote  vectors,  and  lower  case  greek  letters  denote 
scalars.   Except  in  the  cases  where  we  state  time  and  processor  bounds, 
the  symbols  +  and  x  are  taken  to  represent  the  generalized  addition  and 
generalized  multiplication  operations  of  the  regular  algebra.   In  the 
statement  of  the  time  and  processor  bounds,  the  symbols  +  and  *  will 
take  on  their  normal  meaning.   We  will  frequently  omit  the  symbol  x 
when  it  is  clear  that  a  generalized  product  is  to  be  taken.   We  will  al- 
so use  the  Z  and  It  notations  for  the  generalized  sum  or  product  of  a  set 
of  items.   Unless  otherwise  stated,  log  n  =  flog~nl  for  any  positive  num- 
ber n. 

1.2.  The  Regular  Algebra 

For  convenience,  we  restate  the  algebra  of  Carre.   We  start 
by  defining  a  semiring  (S,+,x)  which  satisfies  the  following  properties: 

(i)    commutativity         a  +  3  =  3  +  a 

a  x  3  =  g  x  a 

(ii)    associativity         a  +  (3  +  y)    =    (a  +  3)  +  Y 

ax($XY)  =  (ax£)XY 

(iii)   distributivity        a  x  (3  +  y)    =    (a  x  3)  +  (a  x  y) 

(iv)    idempotency  a  +  a  =  a 

for  a,  3,  Y,  e,  S. 


The  set  S  has  a  unit  element  e  such  that 

a  x  e  =  a, 

and  a  null  element  9  satisfying 

0  +  9=0,  a  *  9  =  9, 
for  all  a  e  S.   Furthermore,  we  have  the  law  of  multiplicative  cancellation: 
(v)   if  a  4-   9  and  a  x  3  =  a  x  y   then  3  =  y. 

We  define  for  the  semiring  (S,+,x)  the  order  relation  <_: 

a  <_  3  if  and  only  if  a  +  3  =  a. 
1.3.      Extensions  to  Matrix  Operations. 

We  now  extend  the  definitions  of  the  regular  algebra  to  matrices 
all  of  whose  elements  belong  to  the  set  S.   The  definitions  of  matrix 
addition,  matrix  multiplication  and  matrix  transposition  are  analogous 
to  those  in  linear  algebra.   Note  that  matrix  addition  is  idempotent, 

A  +  A  =  A. 
We  define  a  square  m  by  m  unit  matrix  I  =  [e.  .1  with  e.  .=  £ 
if  i  =  j  and  e .  .  =  9  if  i  ^  j  .   Note  that  IA  =  AI  =  I  for  any  m  by 
matrix  A.   The  null  matrix  N,  all  of  whose  elements  are  6,  satisfies 

A  +  N  =  A 
A  x  N  =  N. 
The  partial  ordering  for  matrices  is  easily  extended: 

A  <  B  if  and  only  if  a .  .   <   3.  .   for  all  i,j. 


m 


In  particular,  it  follows  that 

A  <_   B  if  and  only  if  A  +  B  =  A. 

The  shortest  path  problem  can  now  be  stated  as  a  linear 

system  in  the  regular  algebra.   Let  G  (V,  A)  be  a  graph  with  distance 

matrix  D.   Let  the  vector  b  represent  an  initial  labeling  of  the  nodes 

V  (for  the  vector  b,  3.  is  the  inital  label  on  node  i) .   The  shortest 

path  problem  is  equivalent  to  solving  the  linear  system  x  =  Dx  +  b  in  the 

regular  algebra  [2].   For  the  single  source  problem  the  vector  b  is  given  by 

by  8  =  e  where  s  is  the  source  node,  and  3.  =  6  otherwise, 
s  1 

2 .        Solution  of  Recurrences 

In  this  section  we  study  the  solution  of  systems  of  the  form 
x  =  Lx  +  f,  (1) 

where  the  matrix  L  is  a  strictly  lower  triangular  matrix.   These  systems 
arise  in  the  analysis  of  Fortran  programs,  and  are  also  used  in  methods 
for  solving  general  systems 

x  =  Ax  +  b  . 


2.1.      The  Column  Sweep  Algorithm 

The  most  fundamental  of  all  algorithms  for  solving  recurrences 
in  the  regular  algebra  is  the  column  sweep  algorithm.   The  solution  of 
x  =  Lx  +  f  ,  where 


L  = 


X2,l   9 
A3,l   X3,2 


11,1   n,2    n,n-l 


X   = 


f  = 


,   (2) 


is  given  by 


5l  -^  , 


n 


j-l 

?.  -  I        \.    .      ?,   +  4>.  .    1  -  2,3,.. .,n 
J         j,i   i      j 

i=l 


(3) 


Rewriting  the  system  (3)  as  column  operations  we  get  the 
Column  Sweep  Algorithm: 

1.  Set  x  +■   f 

2.  For  i  =  1,  2,  . .. ,  n-1 


x  ■«-  x  +  C.  £. 
i  i 


(in  parallel) 


In  the  algorithm,  I.    represents  the  ith  column  of  the  matrix 

L.   A  sequential  algorithm  for  solving  recurrence  (1)  requires 

2  2 

n  -  n  operations,  i.e.  T  =  n  -  n.   On  the  other  hand,  the  column 

sweep  algorithm  can  be  performed  in  2 (n-1)  time  steps  using  at  most  n-1 

processors.   This  gives  a  speedup  of  -~  +   0(1)  over  the  sequential 

algorithm  with  a  corresponding  efficiency  of  roughly  1/2. 


2.2       The  Product  Form  of  the  Solution 

In  this  section,  the  solution  x  of  (1)  is  formed  as  a  sum  of 
products.   Each  product  involves  a  power  of  the  matrix  L  multiplied 
by  the  vector  f. 

Lemma  1.   Let  L  be  strictly  lower  triangular,  i.e.  L  has  the  form  given 
in  equation  (2).   Also,  let 


L1  = 


N   N 


T 
LR   N. 


where  R  is  upper  triangular  of  order  n-i,  i.e.  L  is  lower  triangular 

with  i  null  diagonals  in  its  lower  half. 

Proof:   From  the  definition  of  matrix  multiplication  and  the  facts  that 

9+e   =   6,    6  x  a  =  0  for  all  a  e  S,  and  that  L  is  null  on 

the  main  diagonal,  each  multiplication  of  L     by  L  introduces  another 

diagonal  of  null  elements  into  the  product.  ■ 

Corollary  1:   Ln  =  N  .  ■ 

Lemma  2:   The  product  LL-1   i_<j,i  +  j<n  can  be  formed  in  log(n-i-j)  +  1 

steps  using  at  most  p(n-i-j)  processors,  where 
k   i 
p(k)  =   E   £  j  -  k(k+l)(k+2)/6  . 
i=l  j=l 

Proof :   This  result  also  follows  from  the  definition  of  matrix  multi- 
plication.  Since  L  has  i  null  diagonals  and  LJ  has  j  null  diagonals, 
the  longest  dot  product  formed  is  of  length  n-i-j .   This  establishes  the 
time  bound.   The  processor  bound  is  established  by  counting  the  number  of 
multiplications  needed  to  compute  the  product.   This  is  the  stated 
processor  count  since  all  of  the  multiplications  can  be  formed  simultaneously . 
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The  product  form  of  the  solution  is  found  by  substituting  for  x 

in  the  right  hand  side  of  (1),  its  equivalent,  Lx  +  f.   Thus, 

2 

x  =  L  x  +  Lf  +  f. 

After  n-1  such  substitutions,  we  obtain 

x  =  Lnx  +  Ln-1f  +  ...  +  Lf  +  f. 


From  Corollary  1,  L  =  N  so 


x  =  L   f  +  L   f  + 


Lf  +  f. 


Furthermore,  equation  (4)  can  be  written  in  the  form 
n/9       n/  ? 

x  =  (I  +  L   Z)(I  +  L  *)  ...  (I  +  IT)  (I  +  L)  f. 


(A) 


(5) 


Heller's  algorithm  [11]  for  solving  triangular  systems  of 
linear  algebraic  equations,  can  now  be  applied. 
Algorithm  1.   Product  form  of  solution  for  recurrences. 

Set  x0  «-  f 

For  i  =  0,  1,  ...,  v  =  log(n/4) 

x.+1  <-  (I  +  L21)  x. 

2  i+1   o i  o  i 

L    *■  L   L 


x  *■   (I  +  L  )  x 


v+1 


Theorem  1:   Algorithm  1  can  be  performed  in  parallel  in  log  n  +  3  log  n  +  1 

1  3 
steps  using  at  most  -?   n  processors. 

Proof:   Heller  has  already  shown  that  Algorithm  1  can  be  performed  in 

2 
log  n  +  3  log  n  +  1  steps.   The  formation  of  x    and  x  requires  only 

2  2i+1  3 

0(n  )  processors  whereas  the  formation  of  L     requires  0(n  )  processors. 

2 
In  particular,  the  largest  number  of  processors  are  required  in  forming  L  . 

From  Lemma  2,  this  number  is 


p(n-2)  =  -|  (n-2)(n-l)n  <  \  n 


processors. 


2.3       Product  of  Elementary  Matrices  Solution 

The  method  presented  in  this  section  is  an  analog  of  Sameh  and 
Brent's  Algorithm  I  [22].   It  is  superior  to  Algorithm  1  above  since  it 
takes  approximately  half  the  time  while  using  significantly  fewer 
processors  to  solve  the  recurrence  (1) . 


Definition:   M.  is  an  elementary  matrix,  if  it  is  of  the 


form, 


M. 


N 


3+1,  J 


3+2,3 


n,3 


N 


N 


M.  has  all  null  entries  except  in  column  j  below  the  main  diagonal, 

Lemma  3:   Let  M.  and  M.  be  elementary  matrices,  then  M.M.  =  N  if 
13  13 

i  <  i.   Furthermore  if  i  >  i  then  P  =  M.M.  has  all  null  elements 

13 


except  in  column  j  below  the  ith  row. 


10 


Proof:   M.  has  its  first  i  rows  all  null  and  M,  has  all  null  elements 
J  i 

except  in  column  i  below  the  main  diagonal.   Thus  if  1  <  j  all  inner 
products  of  rows  of  M.  with  columns  of  M.  must  be  null,  i.e.   M  M  =  N. 
If  i  >  j  then  the  only  non-null  inner  products  of  P  =  MM.  occur  when 
the  inner  products  of  rows  i+1,  i+2,  ...,  n  of  M.  are  taken  with  col- 
umn j  of  M. .  ■ 
J 

Lemma  4:   Let  M  ,  M  ,  . ..,  M. ,  M    be  elementary  matrices.   Then  the 

product 

(I  +  M.±1)(I+M.)  ...  (I  +  M.)(I  +  M.)(I  +  M.)  ...(I  +  M..J 
J+l       J  112  j+1 

(I  +  M.+1)(I  +  M.)  ...  (I  +  Mx).  (6) 

The  proof  is  by  induction  on  j . 

Basis  step:   For  j  =  1,  we  have 

2 
(I  +  M  )  (I  +  M  )  =  I  +  M  +  M  +  M   . 

From  Lemma  3  and  idempotency 

I  +  M  +  M  +  M   =  I  +  M  +  M  =  I  +  M  , 

establishing  the  basis. 

Induction  step:   Assume  that  the  assertion  holds  for  index  j,  and  let 

M.  -  (I  +  M.)(I  +  M   )  ...  (I  +  M  )(I  +  M  ). 

From  the  inductive  hypothesis  we  have 

(I  +  M.^JC  +  M.)    ...    (I  +  M,)(.I  +  M.)(I  +  M_)    ...    (I  +  M...)    = 
J+l  3  112  j+l 

(I  +  M.,n)    M.  (I  +  M.,.). 

3+1        J  3+1 


11 


Furthermore, 


M.(I  +  *!._,,)  =  M.  +  M.M.  in  .  (7) 

J      J+l     J    J  3+1 


By  repeated  applications  of  Lemma  3  we  obtain 


hence  (7)  becomes 


Consequently, 


mjVi  ■  Vi  • 


M.(I  +  M.,.)  =  M.  +  M.j.  . 
J      J+l     1     J+l 


(I  +  M.,  .)M.(I  +  M., .)  -  (I  +  M.j_.)(M.  +  M., .) 
j+l  j      j+l  j+l   j     j+l 

=  M.  +  M.  ,.M.  +M.M  +  M.  ,, 
j     j+l  j     j+l     j+l 

=  M.  +  M.  ,.M.  +  M.  ., 

=  M.  +  M. ,, (M.  +  I) . 
J     J+l  J 

Since  I  is  an  element  of  the  expanded  sum  of  M.,  then  due  to  idempo- 

tency,  we  have 

(I  +  M.,,)M.(I  +  M.,_)  =  M.  +  M.j^M. 
J+l  J      J+l     J    J+l  J 

=  (I+Mj+1)M., 

establishing  the  lemma.  I 

In  a  similar  fashion,  it  can  be  shown  that 
(I+M1)(I+M2)  ...  (I  +  M.+1)(I  +  M.+1)(I  +  M.)  ...  (I  +  Mx)   = 

(I  +  M.+1)(I  +  M.)  ...  (I+M^.  (8) 


Let 


(I  +  L)   =  (I  +  M  _)(I  +  M  _)  ...  (I  +  M0)(I  +  M.)  .         (9) 
n-1       n-Z  L  1 


12 


Since 

(I  +  L)  =  (I  +  M.)(I  +  M_)  ...  (I  +  M   J  (I  +  M   .), 

1       l  n-z       n-1 

from  Lemma  4  and  equation  (8) ,  we  see  that 

*  *  *         (10) 

(I  +  L)  (I  +  L)  -  (I  +  L)   =  (I  +  L)(I  +  L)  .        y      } 

We  now  introduce  a  lemma  of  fundamental  importance  to  what 
follows . 

Lemma  5:   The  vector  x  solves  the  system  x  =  Ax  +  f  if  and  only  if  x 

solves  the  system  x  =  x  +  Ax  +  f . 

Proof:   If  x  =  Ax  +  f  then  adding  x  to  both  sides  and  using  idempotency 

gives 

x+x=x+Ax+f, 
x  =  x  +  Ax  +   f . 
Hence, 

x+Ax+f£Ax+f, 
i.e.,  either  x  =  x  or  x  =  Ax  +  f .   Since  x  =  x  is  redundant,  x  =  Ax  +  f . 

The  following  result  was  originally  given  by  Carre"  [2],  but 
we  state  and  prove  it  in  terms  of  elementary  matrices. 
Theorem  2:   Let  x  =  (I  +  L)  f ,  then  x  satisfies  x  =  Lx  +  f . 
Proof:   From  Lemma  5,  x  =  Lx  +  f  is  equivalent  to  x  =  (I  +  L)x  +  f ,  or 

x  =  (I  +  L)(I  +  L)*f  +  f , 
=  (I  +  L)*f  +  f. 

■k 

From  equation  (9)  f  is  an  element  of  the  expanded  sum  of  (I  +  L)  f ,  thus 
x  =  (I  +  L)  f  satisfies  (11). 


13 


From  Theorem  2,  we  can  write  the  solution  x  to  equation  (1) 
as 

x  =  (I  +  L)*f  =  (I  +  Mn_1)(I  +  Mn_2)  ...  (I  +  M2)(I  +  M1)f, 

which  motivates  the  following  algorithm. 

Algorithm  2:   Solution  of  recurrences  in  product  form. 

1.  Set  (I  +  M.)(0)  =  I  +  M  ,  i-1,  ...,  n-1;  f(0)  =  f. 

2.  For  j=0,  1,  ....  v  -  log(n/4) 
Form  (I  +  M1)(j+1)  =  (I  +  M2i+1)(j)(I  +  M2.)(j) 

in  parallel  for  i=l,  2,  ...,  n/2J    -  1  . 
Form  f(j+1)  =  (I  +Ml)(j)f(j) 

3.  Set  x  =  (I  +Ml)(v)f(v). 


Algorithm  2  is  a  direct  analog  of  Algorithm  I  of  Sameh  and 
Brent  [22]  for  triangular  linear  algebraic  equations.   Hence,  their 
results  (Theorem  2.1)  apply.   The  solution  x  to  the  recurrence 

(1)  can  be  found  in 

12    3 
T  -  —  log  n  +  2   log  n  +  3 

steps  using  no  more  than 

tt  =  (15/1024)n3  +  0(n2)  =  n3/68  +  0(n2) 


processors, 
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2.4 


Limited  Parallelism 


In  this  section  we  present  two  methods  which  are  analogs  of 
those  of  Hyafil  and  Kung  [15]  for  solving  the  recurrence  (1).   These 
methods  are  used  when  the  number  of  processors  p  is  fixed. 

The  first  method  utilizes  the  algorithm  decomposition  applied 
to  the  column  sweep  algorithm: 

,«   -  £ 

(i+1)    ,_     \„(i)   *  i   «,        i 

x      =  (I  +  M.)x    ,  i=l,  2,  ...,  n-1. 

It  is  easy  to  see  that 


(I  +  M.)x 


(i) 


(i) 


i 


(i) 


5i+i(1>  +  W  ?i 


(i> 


n        n,i     i 


(i) 


Thus,  given  p  processors,  the  product  (I  +  M.)x    can  be  formed  in 


n-i 


steps.   Since  T  (n)  =  £   2 
p      i=l 


n-i 


we  have  from  Hyafil  and  Kung 


[15]  that 


T  (n)  <  -  +  (2  -  -)n  +  -  -  2 
P    ~  P         P     P 


This  method  is  most  practical  when  p  <  n, 


Let 
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The  second  method  uses  the  problem  decomposition  principle, 


L  = 


'1,1 


L2,l    L2,2 


Jm ,  1     m ,  2 


N 


m,m 


x. 


m 


f  = 


m 


then  equation  (1)  can  be  written  as, 


Xl  =  L1,1X1  +  fl' 


X2  =  L2,1X1  +  L2,2X2  +  f2» 


m 

x  =  (.Z.  L   .x.)  +  f  , 
m   j=l  m,j  j     m 


This  leads  to  the  following  algorithm. 


Algorithm  3:   Recurrence  solver  with  limited  number  of  processors. 
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taking 


then 


For  1=1,. 
Set  f(i) 


Form  x . 
l 


,  m 
1-1 

Z   L.  .x. 
j-1   X'J  J 


(I  +  L   .)  f 
1,1 


*  (i) 


using  Algorithm  2 


Hyaf il  and  Kung  have  shown  that  for  p  =  fn  1,  1 <  r  <  3,  and 


m  = 


68n 

r  r  i3 


(o(n2  rlog  n)  ,   for  1  <  r  < 

T  (n)  <  J  r 

p    ~  I    1-T   2  3 

^0(n  Jlog  n)  ,   for  y  <  r  < 


3 .   Banded  Recurrences 

In  this  section,  we  take  the  matrix  L  of  (1)  to  be  banded,  i.e., 
of  the  form 


L  = 


N 


Rk-1   Lk  J 


(12) 


where  L.  is  a  strictly  lower  triangular  matrix  of  order  m,  i  =  l,2,...,k, 


and  R.  is  an  upper  triangular  matrix  of  order  m,  j  =  l,2,...,k-l. 
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3.1  Unlimited  Parallelism 

We  can  show  that  the  time  and  processors  required  for  solving  (1) 
are  the  same  as  given  by  Sameh  and  Brent  [22].   The  proof  is  exactly  as 
that  of  their  theorem  3.1.   Before  this  can  be  proved,  however,  we  need  to 
establish  the  following. 
Lemma  6 : 

Let  L  be  a  strictly  lower  triangular  2n  by  2n  matrix  given  by 


L  = 


N 


R    L 


2-1 


(13) 


where  R  is  n  by  n  and  upper  triangular, 
partitioned  as 


fT  =  (fT  fT)  . 


Thus,  the  solution  of  the  linear  recurrence 


Let  f  be  a  2n-vector  correspondingly 


x  =  (I0  +L)x  +  f 
zn 


(14) 


is  given  by 


x  = 


I  2  J 


I    N 
n 


G    I 


n„ 


Ly2. 


(15) 


where 


G  =  (I  +LJ  R  , 
n  2 


and 


?i  =  W  f±  •     * = 1>2 
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Proof; 

From  the  structure  of  L  and  Theorem  2,  we  see  that 

Xl  ■  (In+Ll>\  ■  "l  • 

Also,  from  (13)  and  (14)  we  have 

x.  =  Rx.  +  (I  +L0)x0  +  f   . 
2     1     n  2   2    2 


From  Theorem  2,   we  have 


x2   -    (I  fL/lR  x  +f    ] 


•  %  *  y      , 


proving  the  lemma.  ■ 

Now  we  state  the  main  result. 

Theorem  3:   The  linear  recurrence 

x  =  (I+L)x  +  f  , 

where  L  is  an  n  by  n  strictly  lower  triangular  matrix  (12),  i.e.,  n  =  km, 
is  obtained  in  (3  +  2  log  m)  log  n  -  (1/2) (log  m) (1  +  log  m)  time  steps 
requiring  less  than  m(m+l)n/2  processors. 

Proof; 

T     T  T      T 
Let  f  =  (f  ,f _, . . . ,f  ) .   The  algorithm  of  Lemma  6  can  be 

generalized  to  yield  a  scheme  which  is  exactly  similar  to  that  of  Sameh 

and  Brent  [22],  Theorem  3.1  as  follows. 


Algorithm  4:   Banded  recurrences. 

(Q)  * 

Step  1:   Set  y  v  J    =    (I+L  )  f^  and 
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[Gi-l,yi°)]  =  (I+Li)*[Ri-l,fi]'  1=2>  3' 


. . ,  n/m. 


Step  2:   For  j=l,  2,  . ..,  v  =  log(n/m)  -  1, 


Set  r  -  2Jm 

Set  G.(j+1)  - 

1 

N 
N 

r(j) 
G2i 

r(j)   r(j) 
G2i+1  G2i 

,  i"lf  2, 


'  2r 


-  1 ,  and 


>'i 


(J+D  _ 


y(j) 

y2i-l 

G(j)    (J)   +y(3) 
G2i-1  Y2i-1   Y2i 


,  i=l,  2, 


n 
'  2r 


Upon  termination,  y       contains  the  solution  vector  x. 


3.2        Limited  parallelism 

In  this  section  we  will  assume  that  m<p<<n,  where  p  is  the 
number  of  processors  available.   If  p=m  we  can  use  the  column  sweep 
algorithm  to  solve  the  recurrence  in  2(n-l)  steps.   We  will  develop  a 
faster  algorithm  for  p>m  along  the  same  lines  as  the  algorithm  of 
Theorem  2.3.13  of  Kuck  and  Sameh  [18],  for  solving  banded  triangular 
systems  of  linear  algebraic  equations.   Our  results  are  summarized  by  the 
following  theorem. 

Theorem  4:   Given  p  processors,  m<p<<n,  a  linear  recurrence  with  the 
matrix  L  of  the  form  (12)  can  be  solved  in  time 


T  =  2(m-l)  +  t  [(n-m)/p(p+m-l)~[  , 
P 


(16) 
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where 


t  =  max 


(2m2+3m)p  -  (m/2) (2m2+3m+5) 
2m(m+l)p  -  2m 


(17) 


This  yields  a  speedup  and  efficiency  proportional  to  (p/m)  and  (1/m)  re- 
spectively. 
Proof;   To  motivate  the  approach,  consider  a  recurrence  of  the  form 

z  =  Lz  +  g  of  order  s  =  q  +  p(p-l) 

equations  with  q  =  mp, 


Ro  Li 


Rl   L2 


R  ,   L 
p-1   p 


~zo~ 

zl 

8l" 

Z2 

g2 

• 

+ 

• 

z 
P 

8p 

(18) 


Here,  L  is  of  order  q,  L.(i>l)  is  of  order  p,  and  R.(0<i<p-1)  con- 
tains non-null  elements  only  on  its  top  m  super  diagonals,  i.e., 

N   R 


R.  = 

i 


N  J 


(19) 


in  which  R  is  upper  triangular  of  order  m.   The  system  (18)  can  be  expressed 


as 


zi  =  Ro    zo  +  Lizi  +  8r 


z2  -  Rx  Z±   +  L2z2  +  g2, 


z   =  R  nz  n+L  z  +  g  , 
P    P-1  P-1   P  P    P 
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From  Theorem  2,  we  get  that  the  z.'s  can  be  written  as 


* 

z  =  (I+Lj)  tRj-lzj_i  +  8j]'  J=1'  •*•'  P'   or 

Zj  =  ((I+L)*Ej_1)zJ_1  +  (I+L)*gr  (2Q) 


Therefore,  we  define 


h±  =    (I+L1)*(R()z0  +  §1),  (21) 


[Gi_1,h.]  =  a+^±)*[\_rl»g±\,    i=2,  3,  ...,  p.      (22) 


2 
Using  a  single  processor,  equation  (21)  can  be  evaluated  in  t  =2m  p 

steps.   Using  a  single  processor  to  evaluate  each  of  the  p-1  systems, 

equation  (22)  can  be  evaluated  in 

x2  =  [(2m2  +  m)p  -  (m/2)(2m2  +  3m  -  1) ] 
steps. 

Kuck  and  Sameh  [18]  have  shown  that  the  choice  of  q=mp  keeps 
the  difference  in  time  for  evaluating  equations  (21)  and  (22)  as  small 
as  practically  possible.   Thus  the  evaluation  of  equations  (21)  and 
(22)  can  be  done  in  at  most  t  =  max{x  ,T  }  steps. 

We  now  see  from  equations  (20) ,  (21) ,  and  (22)  that  z  is  giv- 
en by 
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Zl  =  hl 


z±   =  h±  +  Gi_12i_1,  i=2,  3, 


Since  only  the  last  m  columns  of  G  are  non-null,  then  by  using  all 

of  the  p  processors,  each  z  can  be  computed  in  2m  time  steps.   Thus, 

z_,  z„,  ...,  z  are  all  obtained  in  time 
2   3        p 

t^  =  2m(p-l) , 


and  hence  the  recurrence  z  =  Lz  4-  g  can  be  solved  in  time 


T  =  T3  +  TA 


=  max 


f(2m2+3m)p  -  (m/2)  (2m2+3m+5)  , 
(2m(m+l)p  -  2m. 


Partition  the  recurrence  x  =  (I+L)x  +  f  of  n  equations 
and  bandwidth  m+1  into  the  form 


x. 


*k 


I+V, 


ui  I+Vi 


\    I+\ 


—               — 

xo 

fo 

xl 

£1 

. 

+ 

. 

• 

• 

• 

• 

xk 

£k 

—              — 

—             — 
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where    I+V   is  of  order  m,  and  I+V.  ,  i  >  0  (except  possibly  for  I+Vfc) 
is  of  order  s,  i.e.,  k  =  [  (n-m)/s  |,  and  each  U.  is  of  the  form  (19). 
Then,  the  solution  vector  x  is  given  by 


xQ  -  (I+V0)  f0  (23) 


x.  =  (I+V.)  (f.  +  U.x.  ,),  i=l,  2,  ...,  k.        (24) 
l       ill  l-l 


Equation  (23)  can  be  evaluated  in  2(m-l)  steps  using  the  column  sweep 
algorithm  since  p>m.   The  k  equations  in  (24)  can  be  solved  one  at  a 
time  using  the  algorithm  developed  for  solving  recurrence  (18) .   Hence 
using  p  processors,  the  recurrence  (12)  of  order  n  and  bandwidth  m=l 
can  be  solved  in  time 

T  =  2(m-l)  +  x  [""(n-m)/p(p+m-l)~[, 

where  i  is  given  by  equation  (17) .  ■ 

4.         Linear  Systems   x  =  Ax  +  b . 

In  this  section,  we  will  deal  with  the  general,  linear  sys- 
tem 

x  =  Ax  +  b.  (25) 

In  section  1.3,  we  saw  that  the  solution  of  the  shortest  path 
problem  can  be  obtained  by  solving  the  system  (25)  .   Due  to  the  form  of 
(25),  the  n  by  n  matrix  A  has,  in  general,  its  only  null  elements  on 
the  main  diagonal.   The  matrix  A  is  called  sparse  if  it  has  a  small  pro- 
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portion  of  non-null  elements.   If  we  let  n  be  the  number  of  non- 

A 

2 
null  elements  of  A,  then  we  say  that  A  is  sparse  if  n  <<n  . 

We  reiterate  the  assumption  that  the  matrix  A  is  defi- 
nite, that  is, the  associated  graph  G  has  no  cycles,  the  sum  of 
whose  arc  weights  are  less  than  or  equal  to  zero.   Now  we  present 
two  direct  methods  and  two  iterative  methods  for  solving  the  sys- 
tem (25).   Since  the  Matrix  A  is  definite,  the  iterative  methods 
can  also  be  posed  as  direct  methods.   We  also  discuss  the  special 
case  when  all  of  the  arc  weights  of  the  underlying  graph  G  are  po- 
sitive.  The  section  is  concluded  with  a  comparison  of  the  methods 
when  the  matrix  A  is  sparse. 
4.1       Elimination  Methods  (Carre"  [2],  §7.2,  7.3) 

The  method  of  generalized  Gaussian  elimination  was  first 
proposed  by  Carre  [2,3].   We  state  the  method  and  give  time  and 
processor  counts  for  a  parallel  implementation. 

Algorithm  5:   Parallel  Gaussian  Elimination 

For  k=l,  2,  . . . ,  n-1 

Set  a.=a.,x   a    +  a.  .\ 

1,3    1,k     k>3  1,J  i      in  parallel  for  (i,j=k+l, 

3i   =  ai  k  X  6k   +  3i   )   k+2'  "•'  n;  ^ 
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The  resulting  upper  triangular  system  can  be  solved  using  a  meth- 
od of  Section  2.   The  reduction  of  the  matrix  A  to  upper  triangular 

2 
form  can  be  accomplished  in  2(n-l)  steps  using  at  most  (n-1)   proces- 
sors. 

The  added  time  and  perhaps,  added  processors  necessary  to 
solve  the  resulting  linear  recurrence  of  Gaussian  elimination  can  be 
avoided,  by  using  the  generalized  Jordan  elimination  method  of  Carre 
[2,  §7.3]. 

Algorithm  6:   Parallel  Jordan  elimination 

For  k=l ,  2 ,  . . . ,  n-1 

Set  a.  .  =  a    x  a   .  +  a.  . 

i,j    i,k    k,j    i,j   in  parallel  for  1=1>  ##<>  n; 

6.   ...  .«(.   +  B.   J-"*.  *« »«**>■ 

x      i,k    k      i 


Note  that  upon  termination,  the  vector  b  contains  the  solution  x.   Clear- 

2 
ly,  Algorithm  6  can  be  performed  in  2 (n-1)  steps  using  at  most  n  +1  pro- 
cessors.  Thus,  we  see  that  by  using  an  additional  2n  processors  over 
Gaussian  elimination,  we  can  solve  the  system  (25)  in  the  same  time  as 
it  takes  to  do  the  elimination  step  of  Gaussian  elimination. 

4.2        Iterative  Methods 

In  this  section  we  present  two  iterative  methods  for  solving 
the  system  (25) .   They  correspond  to  the  Jacobi  and  Gauss-Seidel  methods 
for  the  iterative  solution  of  linear  algebraic  equations.   The  methods  of 
Bellman  [1]  and  Moore  [21]  are  sequential  versions  of  the  generalized 
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Jacob!  method.   Ford  and  Fulkerson's  [10]  sequence  of  scanning  the  arcs 
in  Ford's  method  [9]  is  a  sequential  version  of  the  generalized  Gauss- 
Seidel  method. 

The  system  (25)  can  be  solved  using  the  generalized  Jacobi 

method 

(k)    .  (k-1)  L  . 
xv   =  Ax      +  b.  (26) 

Since  the  graph  G  associated  with  the  distance  matrix  A  is  assumed  to  be 
definite,  from  Theorem  6.1  of  Carre  [2],  for  an  initial  guess  of  a  null 

x    ,  we  see  that  x    =  x. 

(k-1) 

We  observe  that  the  matrix-vector  product  Ax      can  be 

2 
formed  in  parallel  in  log  n  +  1  steps  using  at  most  n  processors. 

2 
Thus  a  single  iteration  requires  log  n  +  2  steps  and  at  most  n  proces- 
sors.  The  overall  algorithm  can  be  accomplished  in  no  more  than 

2 
n(log  n  +  2)  steps  using  n  processors. 

In  our  context,  the  generalized  Jacobi  method  can  be  con- 
sidered  as  a  direct  method.   Since  x    is  the  solution  to  the  system 
(25) ,  we  can  write  it  by  repeatedly  using  equation  (26) ,  as 

(n)    .n  (0)  ,  An-1,    »n-2,         .,  ,  .        /97\ 
x    =  A  x    +  A   b+A   b+...+Ab+b.        k^i ) 

Since  x    was  chosen  to  be  null  ,  the  equation  (27)  simplifies  to 


x 


(n)  =  An_1b  +  An~2b  +  . . .  +  Ab  +  b,  (28) 


which  can  be  evaluated  in  parallel  using  the  following  algorithm. 
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Algorithm  7:   Direct  Jacobi  Method 


Set  x  =  b 


For  i=0,  1,  ...,  v  =  log(-r) 


x.,-  =  (I+A  )x.   (we  can  also  terminate  whenever  x., ,=x.) 
i+l  x  1+1   1 


,i+l 


21  21 

=  A  A 


x  =  (I+A7)x 


v+1    (if  necessary) 


This  is  the  method  proposed  by  Dekel  and  Sahni  [6],  for  cube 

connected  and  perfect  shuffle  computers.   By  summing  the  time  counts  for 

Algorithm  7 ,  and  observing  that  the  maximum  number  of  processors  occurs 

2i+l 
during  the  formation  of  A     ,  we  get  the  following  theorem. 

Theorem  5;   The  solution  to  the  system  (25)  can  be  obtained  in  no  more 

2  3 

than  2  log  n  +  2  logn  -  1  steps  using  n  processors.  ■ 

We  will  now  discuss  a  parallel  implementation  of  the  general- 
ized Gauss-Seidel  method.   We  begin  by  observing  that  the  matrix  A  of 
the  system  (25)  can  be  written  as 


A  =  L  +  U. 


(29) 


Cbnsequently,  (25)  becomes 


x  =  (L+U)x  +b, 


(30) 


which  leads  to  the  generalized  Gauss-Seidel  iteration 


x 


(fcfl)  .    OcH)  +  Ux(k)  +  b- 
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C      1      4  «  (k+1)  •       U 

Solving  for  x      yields 


x(k+l)  m    (I+L)*Ux(k)  +  (I+L)*b 


Mx(k)  +  c.  (31) 


From  formula  (2.16)  of  Sameh  and  Brent  [22],  we  get  that  the 

1    2 
iteration  matrix  M  and  vector  c  can  be  formed  in  —  log  n  +  0(log  n) 

3       2 
steps  using  at  most  —   +  0(n  )  processors.   Thus,  the  solution  x  can  be 

6 

found  by  the  generalized  Gauss-Seidel  method  in  n  log  n  +  0(n)  steps 

n3       2 
using  at  most  -?       +   0(n  )  processors. 

Carre1  [2]  has  shown  that  the  number  of  iterations  required 
by  the  generalized  Gauss-Seidel  method  is  not  greater  than  the  number 
required  by  the  generalized  Jacobi  method.   It  is  thus  likely  that  for 
many  problems,  the  overall  work  for  the  generalized  Gauss-Seidel  meth- 
od will  be  less  than  that  of  the  generalized  Jacobi  method,  when  used 
in  an  iterative  fashion.   This  potential  savings  in  time  has  a  sub- 
stantial cost  in  additional  processors. 

When  used  as  a  direct  method,  the  generalized  Gauss-Seidel 

scheme  would  use  Algorithm  7  applied  to  the  matrix  M  and  vector  c  of 

1    2 
equation  (31).   Note  that  an  additional  —  log  n  +  0(log  n)  operations  are 

required  to  form  the  matrix  M  and  vector  c.   Hence,  for  the  generalized 

Gauss-Seidel  method  to  be  more  efficient  than  the  generalized  Jacobi 

method,  Algorithm  7  applied  to  the  matrix  M  and  vector  c  must  take  at 

least  25%  fewer  iterations  than  the  same  algorithm  applied  to  the  original 

matrix  A  and  vector  b.   Consequently,  one  would  prefer  the  generalized 

Jacobi  method  for  solving  the  system  (25) . 
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4.3       Dijkstra's  Method 

We  now  consider  the  special  case  when  all  of  the  arc  weights 
of  the  associated  graph  G  are  positive.   In  the  regular  algebra  this 
means  that  the  elements  of  the  matrix  A  of  the  system  (25)  satisfy 
a.  .  >  e  for  all  i  and  j.   Due  to  this  added  assumption,  it  is  easy 
to  see  that  there  can  be  no  linear  algebraic  analogs  for  algorithms 
tailored  to  these  problems. 

Dijkstra's  method  [7]  is  such  an  algorithm,  iterative  in  na- 
ture with  no  linear  algebraic  analog.   In  Dijkstra's  method,  nodes  of 
the  graph  G  are  separated  into  two  classes,  temporary  and  permanent. 
All  nodes  start  out  temporary  and  become  permanent  one  at  a  time.   The 
algorithm  terminates  once  all  nodes  have  become  permanent. 

Let  mode  [*]  be  an  array  of  bits  used  to  indicate  whether 
a  given  node  is  temporary  or  permanent.   When  mode[i]=0,  then  node 
i  is  temporary.   If  mode[i]=l,  node  i  is  permanent.   If  we  wish  to 
solve  the  system  (25),  where  a.  .>  e  for  all  i  and  j,  then  Dijkstra's 
algorithm  can  be  stated  as  follows: 
Algorithm  8:   Parallel  Dijkstra's  Algorithm 
The  vector  b  is  overwritten  by  the  solution  x. 
Initialize:   mode[i]=0,  i=l,  2,  ...,  n. 

For  i=l,  2,  . . . ,  n-1 

Find  j  such  that  3.  =   16. 


mode[k]=0 


modetj  ]=1 


(32) 
(33) 


For  all  k  such  that  mode[k]=0 


\    =6.  +6.  *  a.  , 
k    k    j    j,k 


(34) 
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We  observe  that  for  each  iteration,  line  (32)  can  be  done 
in  log(n+l-i)  steps  using  n-i  processors.   Overall,  line  (32)  requires 

n 

E  log  i  =  nlogy 
i=2 

operations  and  at  most  n-1  processors.   Similarly  line  (33)  requires 
one  operation  each  iteration,  and  line  (34)  can  be  performed  in  paral- 
lel at  each  iteration  using  two  operations  and  n-i  processors.   The 

overall  work  for  a  parallel  version  of  Dijkstra's  method  is  nlogn+2n-3 
steps  using  at  most  n-1  processors.   Compared  with  n  iterations  of  the 

generalized  Jacobi  method,  Dijkstra's  method  requires  essentially  the 

2 
same  time,  but  far  fewer  processors  (n-1  vs.  n  ).   Thus,  for  dense  ma- 
trices A,  Dijkstra's  method  appears  to  be  the  iterative  method  of  choice, 
4.4       Sparse  Matrix  Considerations 

When  the  matrix  A  of  the  system  (25)  is  large  and  sparse,  that 

2 
is  the  number  of  arcs  n.  in  the  graph  G  is  much  less  than  n  ,  direct 

methods  are  impractical.   Unless  the  matrix  A  is  banded  or  has  a  spe- 
cial structure,  undesireable  fill-in  of  the  matrix  A  will  occur.   Since 
the  economization  of  storage  for  large  sparse  matrices  is  essential, 
iterative  methods  must  be  used. 

Not  all  iterative  methods  are  appropriate  for  this  problem. 
The  generalized  Gauss-Seidel  method  requires  the  formation  of 

* 

the  iteration  matrix  M  =  (I+L)  U.   This  matrix  in  general  will  be  dense 
and  hence  too  costly  to  store.   We  will  thus  restrict  ourselves  to  a  dis- 
cussion of  the  generalized  Jacobi  method  and  Dijkstra's  method. 

Let  q  be  the  maximum  number  of  non-null  elements  in  any  row  of 
A.   In  order  to  compare  the  two  algorithms,  we  will  use  a  cost  function 
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which  is  the  product  of  the  number  of  processors  and  time.   For  dense 

matrices,  we  see  that  Dijkstra's  method  has  a  cost  of 

2  2 

(n-l)(n  log  n  +  2n  -  3)  =  n  log  n  +  0(n  )  . 

3  3 

Similarly,  the  generalized  Jacobi  method  has  a  cost  of  n  log  n  +  0(n  ) 

which  is  about  n  times  greater  than  that  of  Dijkstra's  method. 

For  sparse  matrices,  the  generalized  Jacobi  method  requires 

nlogq  +  2n  steps  and  n  -n<n(q-l)  processors.   Dijkstra's  method  still 

requires  n  log  n  +  2n  -  3  steps  and  n-1  processors.   So,  the  cost  for  the 

generalized  Jacobi  method  becomes 

2  2 

n(q-l) (nlogq  +  2n)  =  (q-l)n  logq  +  0(qn  ), 

while  the  cost  of  Dijkstra's  method  remains  the  same.   When  the  gener- 
alized Jacobi  method  requires  the  full  n  iterations,  it  is  more  cost 
effective  than  Dijkstra's  method   if   (q-l)logq  <  log  n.   In  general, 
the  generalized  Jacobi  method  requires  a  fraction  of  n  iterations.   Let 
k  represent  the  number  of  Jacobi  iterations  required  for  a  given  prob- 
lem.  Then  if  k(q-l)log  q  <  n  log  n,  the  generalized  Jacobi  method  is 
the  superior  algorithm.   A  similar  tradeoff  between  Dijkstra's  method  and 
a  version  of  Ford's  method  was  found  by  Hulme  and  Wisniewski  [14]  for 
sequential  algorithms. 
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